Machine Learning in Finance

Module 3: Introduction

Author
Affiliation

Matthew G. Son

University of South Florida

Introduction to AI

AI, ML, DL

Artificial Intelligence

Artificial Intelligence (AI) : intelligent machine

Those machines are called intelligent as:

  • They have decision-making capabilities

  • like human beings (i.e., “smart”)

Programming and ML

Machine Learning

Machine Learning (ML): A subset of AI

  • A technology to allow computer programs to learn the patterns and rules by improving through experience

Algorithms make the computer to learn

  • Some from Statistics, others from Computer science

  • Neural network (Deep Learning) one of such algorithm

Programming and ML

Types of ML

Types of ML

  1. Unsupervised Learning
    • cluster, associate, reduce dimensions
  2. (Semi) Supervised Learning
    • predict, forecast, classify
  3. Reinforcement Learning
    • Multi-stage decision making

Unsupervised Learning

Unsupervised Learning

Find patterns and “cluster” those have similar patterns found using \(X\).

  • Principal Component Analysis (PCA)

    • Shrinks \(X\), a.k.a. dimentionality reduction
  • Hierarchical (K-means) clustering

    • Clusters observations to (K) groups
  • Latent Dirichlet Allocation (LDA)

    • Topic modeling
  • Neural Network (e.g., autoencoder)

Use of Unsupervised Learning

Unsupervised learning is useful for

  • Descriptive purpose (patterns)

  • Garnering insights from data

  • Dementionality reduction (feature selections)

Unsupervised Learning in Finance:

  • Stock (Fund) clustering for portfolio

    • Cluster stocks that co-move
  • Clustering

    • Factor zoo: CAPM / Fama / Cahart / etc…
    • Country risk: Legal, Inflation, Peace etc…
  • Transaction Anomaly detection

  • Sentiment analysis (topic discovery)

  • Preprocessing step

Wordcloud

How do I know if it is Unsupervised?

You know by asking yourself if the data has ground truth (Y) or not, aside from predictor (X) variables.

  • Group investors by their trading patterns of each brokerage account holders

  • Cluster credit card transactions types patterns by time, location, amount, frequencies

Unsupervised Leraning Types

Unsupervised learning problems can be grouped into three:

Clustering:

  • Row-wise groupings based on the predictor variables
  • Rule: Use Xs to grouper variable

Association:

  • Find frequent combination of categorical variables
  • Item-wise groupings such as associating keywords to text
  • Rule: Xs to X (co-occurance)

Dimentionality reduction

  • Reduce variables to remove redundance
  • Preprocessing step

Supervised Learning

Supervised Learning

Supervised learning method’s objective is:

For truth value Y, find the best predictor model f given data X for

\[ Y \sim f(X) \]

  • Therefore, supervised requires Y!

Supervised Learning algorithms

  • Linear regressions

  • Logistic regressions (categorical)

  • Decision Trees

  • Boosted Trees

  • Neural network

Regression and Classification

Supervised learning can be grouped into regression and classification problems.

When the predicted variable, Y, is:

  • Continuous variable: Regression

    • Stock return, Option price, volatility, earnings forecast (EPS), …
  • Discrete variable: Classification

    • Bank failure, Boom/bust, Positive/Negative, etc…

Quiz

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Predict Bond Price with given Fed rates

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Group observations by features X1 and X2

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Market basket analysis

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Whether an Image has human faces

Supervised / Unsupervised?

Regression / Classification, or Clustring / Association?

Topic modeling

Model Validation

Train-test split (2 part)

Simplest splitting scheme.

We need data to train/fit parameters for the specified model.

  • Also called as “building a model”

Then we need to validate the model performance with “unseen” data.

  • Typically 70/30, 80/20

Train-valid-test split (3 part)

ML operations involve the hyperparameter tuning process.

Hyperparameter

Hyperparameters are parameters in the configuration of the model that are NOT LEARNED from the model but set prior to the training process. They govern the performance of the ML and also the training time.

Why additional split?

Wihtout validation data, the test results can be biased towards on a specific hyptermarameter setting.

  • Set apart one more data for one step further validation

  • Typically 70/15/15

  • Pick the model with best perfomance on valid, and final check on test

Cross validation (CV)

Instead of 3 part split (train-valid-test), a more robust technique used to access ML performance.

k-fold CV

https://scikit-learn.org/stable/modules/cross_validation.html

K-fold CV

  • Divide train data into k equally-sized folds

  • Each fold is once used as validation set

  • Average out the performance by each hyperparameter setting

  • Pick the best model

  • Final check on test (this hold-out set is optional)

ML Performance

Accuracy of the model

How can we determine the “quality” of ML model?

What makes one ML model “better”?

It is accuracy, o’course!

Bias / Variance

Note

Unsupervised algorithms do not have ground TRUTH to measure accuracy. Therefore, discussions on bias/variance tradeoffs are relevant to supervised algorithms.

Two properties of accuracy

  1. Unbiasedness (less bias or error)
  2. Consistency (less variance)

Bias

Bias is the error of the prediction of a model.

Especially related to train set prediction error.

  • Regression: \(\frac{1}{n}\sum\limits_{i = 1}^{n} (f(X_i) -Y_i)^2\)

  • Classification: \(\frac{1}{n}\sum\limits_{i = 1}^{n} 1_{f(X_i) \neq Y_i}\)

Low bias: predicted data points are close to the target

High bias: predicted data points are far from the target.

Bias in Regression

Based on the training set, we have three linear models:

  1. Linear model: High bias, strong linear relationship assumption of model
  2. Quadratic model: Lower bias, fits data farely well
  3. Higher order polinomial: Lowest bias, fits extremely well

Model complexity and accuracy

In general,

  • Higher complexity, higher accuracy

  • Higher complexity, lower interpretability (black-boxy)

Low bias, is that all?

If we attain low bias (high accuracy) from our train data in our model:

  • Should it work well with outside (new) data?

We care about the model’s generalize-ability.

  • It is of no use if model works only well with train data!

Therefore:

  • We must test the model accuracy with new data

  • that is not used for training the model

Variance

Variance quantifies the sensitivity of parameter estimations to fluctuations in the data setting.

It checks out how “reliable” the model is, out of sample.

High variance: the model does not generalize well

Low variance: the model performance is reliable when outside data is given

Bias / Variance Tradeoff

Example : model1

  1. Linear model: low variance, slope (parameter estimate) and accuracy of model does not change much

Underfitting problem

It did not learn enough!

If we use trained model for prediction on new dataset:

  • Not impressive test accuracy

Example: model2

  1. Quadratic model: moderate variance, a moderate change in slope.

Example: model3

  1. Higher-order polynomial model: high variance as the parameters change a lot with new data

Overfitting problem

It learned too much noise!

If we use trained model for prediction on new dataset:

  • A large drop in test accuracy compared to training

  • Classic overfitting problem

  • Not reliable to unseen data

Balanced model

Achieves good balance

If we use trained model for prediction on new dataset:

  • Less change in accuracy, similar to train accuracy

In a nutshell

Class exercise

Housekeeping

# install packages and load
library(tidyverse)

# read dataset
train <- read_csv("ml_data/table1_1.csv")
test <- read_csv("ml_data/table1_2.csv")

# plot
train |>
  ggplot(aes(x = Age, y = Salary)) +
  geom_point() +
  theme_bw()

Plot them

# Plot fitted line on top of train set
base_plot <- train |>
  ggplot(aes(x = Age, y = Salary)) +
  geom_point() +
  theme_bw()
plot_linear <- base_plot +
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE)
plot_quad <- base_plot +
  geom_smooth(method = "lm", formula = y ~ poly(x, 2), se = FALSE)
plot_poly <- base_plot +
  geom_smooth(method = "lm", formula = y ~ poly(x, 5), se = FALSE)

plot_linear
plot_quad
plot_poly

Fit the model, compare accuracy

# model fit with linear regression
model_linear <- lm(Salary ~ Age, train)
model_quadratic <- lm(Salary ~ poly(Age, 2), train)
model_poly <- lm(Salary ~ poly(Age, 5), train)

What is the train accuracy (using \(R^2\) as metric)?

# These are train accuracy
summary(model_linear)$r.squared
[1] 0.541082
summary(model_quadratic)$r.squared
[1] 0.7987589
summary(model_poly)$r.squared # best
[1] 0.9691108

Test them

# Use trained model to predict on new data
test <- test |>
  mutate(
    Predicted_sal_1 = predict(model_linear, test),
    Predicted_sal_2 = predict(model_quadratic, test),
    Predicted_sal_3 = predict(model_poly, test)
  )
print(test)
# A tibble: 10 × 5
     Age Salary Predicted_sal_1 Predicted_sal_2 Predicted_sal_3
   <dbl>  <dbl>           <dbl>           <dbl>           <dbl>
 1    30 166000         165980.         160184.         117293.
 2    26  78000         150670.         113172.         111072.
 3    58 310000         273144.         271282.         245230.
 4    29 100000         162152.         149161.         104937.
 5    40 260000         204253.         243654.         284898.
 6    27 150000         154498.         125655.          99725.
 7    33 140000         177461.         190334.         173238.
 8    61 220000         284626.         260559.         257932.
 9    27  86000         154498.         125655.          99725.
10    48 276000         234871.         275396.         277161.

Calculate R-squared

\[ R^2 = 1- \frac{\sum(y_i - \hat{y_i})^2}{\sum(y_i - \bar{y})^2} = 1 - \frac{SSR}{SST} \]

# define r2
r2 <- function(predicted, truth) {
  residual_variance <- sum((truth - predicted)^2, na.rm = T)
  total_variance <- sum((truth - mean(truth, na.rm = T))^2, na.rm = T)
  return(1 - (residual_variance / total_variance))
}

Calculate R2

r2(test$Predicted_sal_1, test$Salary)
[1] 0.5900583
r2(test$Predicted_sal_2, test$Salary) # best
[1] 0.8110253
r2(test$Predicted_sal_3, test$Salary) # large drop
[1] 0.7826997

Plotting fitted values on test

base_plot <- test |>
  ggplot(aes(x = Age)) +
  theme_bw() +
  geom_point(aes(y = Salary), color = "black")

model1_plot <- base_plot +
  geom_point(aes(y = Predicted_sal_1), color = "blue4") +
  geom_smooth(
    aes(y = Predicted_sal_1),
    method = "lm",
    formula = y ~ x,
    color = "blue4",
    se = FALSE
  )
model2_plot <- base_plot +
  geom_point(aes(y = Predicted_sal_2), color = "green4") +
  geom_smooth(
    aes(y = Predicted_sal_2),
    method = "lm",
    formula = y ~ poly(x, 2),
    color = "green4",
    se = FALSE
  )
model3_plot <- base_plot +
  geom_point(aes(y = Predicted_sal_3), color = "red4") +
  geom_smooth(
    aes(y = Predicted_sal_3),
    method = "lm",
    formula = y ~ poly(x, 5),
    color = "red4",
    se = FALSE
  )

model1_plot

model2_plot

model3_plot

Lab problem

Lab problem

Using above train / test data, fit the model with 3rd order polynomial regression:

\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + e \]

\[ Salary = \beta_0 + \beta_1Age + \beta_2Age^2 + \beta_3Age^3 + \beta_4Age^4 + \epsilon \]

Report train accuracy (\(R^2\)) and test accuracy. Which model performs better?

Visualize your work, and work with Quarto and render .html for report.

Homework Reading

Reading

  • John C. Hull “Machine Learning in Business”

    • Chapter 1